library(tidyverse)
Scatter plots (using geom_point()) are intuitive, easily
understood, and very common, but we must always consider
overplotting, particularly in the following four
situations:
Large datasets
Aligned values on a single axis
Low-precision data
Integer data
Typically, alpha blending (i.e. adding transparency) is recommended when using solid shapes. Alternatively, you can use opaque, hollow shapes.
Small points are suitable for large datasets with regions of high density (lots of overlapping).
Let’s use the diamonds dataset to practice dealing with
the large dataset case.
Set the point transparency to 0.5.
Set shape = ".", the point size of 1 pixel.
# Plot price vs. carat, colored by clarity
plt_price_vs_carat_by_clarity <- ggplot(diamonds, aes(carat, price, color = clarity))
# Add a point layer with tiny points
plt_price_vs_carat_by_clarity +
geom_point(alpha = 0.5, shape = ".")
shape to 16.# Add a point layer with tiny points
plt_price_vs_carat_by_clarity +
geom_point(alpha = 0.5, shape = 16)
Let’s take a look at another case where we should be aware of overplotting: Aligning values on a single axis.
This occurs when one axis is continuous and the other is categorical, which can be overcome with some form of jittering.
In the mtcars data set, fam and
fcyl are categorical variants of cyl and
am.
mtcars2 <- mtcars %>%
mutate(fcyl = factor(cyl),
fam = ifelse(am == 0, "automatic", "manual"))
plt_mpg_vs_fcyl_by_fam of
fcyl by mpg, colored by fam.# Plot base
plt_mpg_vs_fcyl_by_fam <- ggplot(mtcars2, aes(fcyl, mpg, color = fam))
# Default points are shown for comparison
plt_mpg_vs_fcyl_by_fam +
geom_point()
position_jitter(), setting
the width to 0.3.# Plot base
plt_mpg_vs_fcyl_by_fam <- ggplot(mtcars2, aes(fcyl, mpg, color = fam))
# Default points are shown for comparison
plt_mpg_vs_fcyl_by_fam +
geom_point()
# Alter the point positions by jittering, width 0.3
plt_mpg_vs_fcyl_by_fam +
geom_point(position = position_jitter(width = 0.3))
position_jitterdodge(). Set
jitter.width and dodge.width to
0.3 to separate subgroups further.# Plot base
plt_mpg_vs_fcyl_by_fam <- ggplot(mtcars2, aes(fcyl, mpg, color = fam))
# Default points are shown for comparison
plt_mpg_vs_fcyl_by_fam +
geom_point()
# Now jitter and dodge the point positions
plt_mpg_vs_fcyl_by_fam +
geom_point(position = position_jitterdodge(jitter.width = 0.3, dodge.width = 0.3))
You already saw how to deal with overplotting when using geom_point()
in two cases:
Large datasets
Aligned values on a single axis
We used position = 'jitter' inside geom_point()
or geom_jitter().
Let’s take a look at another case:
This results from low-resolution measurements like in the iris dataset, which is measured to 1mm precision (see viewer). It’s similar to case 2, but in this case we can jitter on both the x and y axis.
Change the points layer into a jitter layer.
Reduce the jitter layer’s width by setting the width
argument to 0.1.
ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species)) +
# Swap for jitter layer with width 0.1
geom_jitter(alpha = 0.5, width = 0.1)
Let’s use a different approach:
geom_point(), set position to
"jitter".ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species)) +
# Set the position to jitter
geom_point(position = "jitter", alpha = 0.5)
Provide an alternative specification:
position argument call
position_jitter() with a width of
0.1.ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species)) +
# Use a jitter position function with width 0.1
geom_point(alpha = 0.5, position = position_jitter(width = 0.1))
Let’s take a look at the last case of dealing with overplotting:
This can be type integer (i.e. 1 ,2, 3…) or categorical
(i.e. class factor) variables. factor is just
a special class of type integer.
You’ll typically have a small, defined number of intersections between two variables, which is similar to case 3, but you may miss it if you don’t realize that integer and factor data are the same as low precision data.
The Vocab dataset provided contains the years of
education and vocabulary test scores from respondents to US General
Social Surveys from 1972-2004.
# Vocab data
library(car)
l1
Examine the Vocab dataset using
str().
Using Vocab, draw a plot of vocabulary
vs education.
Add a point layer.
# Examine the structure of Vocab
str(Vocab)
## 'data.frame': 30351 obs. of 4 variables:
## $ year : num 1974 1974 1974 1974 1974 ...
## $ sex : Factor w/ 2 levels "Female","Male": 2 2 1 1 1 2 2 2 1 1 ...
## $ education : num 14 16 10 10 12 16 17 10 12 11 ...
## $ vocabulary: num 9 9 9 5 8 8 9 5 3 5 ...
## - attr(*, "na.action")= 'omit' Named int [1:32115] 1 2 3 4 5 6 7 8 9 10 ...
## ..- attr(*, "names")= chr [1:32115] "19720001" "19720002" "19720003" "19720004" ...
# Plot vocabulary vs. education
ggplot(Vocab, aes(education, vocabulary)) +
# Add a point layer
geom_point()
ggplot(Vocab, aes(education, vocabulary)) +
geom_jitter()
0.2.ggplot(Vocab, aes(education, vocabulary)) +
geom_jitter(alpha = 0.2)
1).ggplot(Vocab, aes(education, vocabulary)) +
geom_jitter(alpha = 0.2, shape = 1)